dataset: [ADD] 50 Vietnamese dataset from vn-mteb#2964
dataset: [ADD] 50 Vietnamese dataset from vn-mteb#2964KennethEnevoldsen merged 9 commits intoembeddings-benchmark:mainfrom
Conversation
mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py
Outdated
Show resolved
Hide resolved
mteb/tasks/Classification/vie/AmazonCounterfactualVNClassification.py
Outdated
Show resolved
Hide resolved
|
|
||
| from mteb.abstasks.TaskMetadata import TaskMetadata | ||
|
|
||
| from ....abstasks import AbsTaskClassification, MultilingualTask |
There was a problem hiding this comment.
I think your tests are failing, because you need to import from
| from ....abstasks import AbsTaskClassification, MultilingualTask | |
| from mteb.abstasks.AbsTaskClassification import AbsTaskClassification | |
| from mteb.abstasks.TaskMetadata import TaskMetadata |
There was a problem hiding this comment.
Hei happy to see the PR and congratulations on the release.
I know that the paper is already out, but I was a bit sad to see that you only use machine-translated datasets (though the verification pipeline does help a lot).
if you want to make a v2 of the benchmark then it might be ideal to use some of the native datasets in mteb, you can see that there is at 26 available:
import mteb
tasks = mteb.get_tasks(languages=["vie"])
tasks= [t for t in tasks if t.metadata.sample_creation != "machine-translated"]
len(tasks) # 26
Can I also ask you to compute the metrics using:
mteb.get_task(...)
task.calculate_metadata_metrics()
| dialect=[], | ||
| sample_creation="machine-translated", | ||
| socioeconomic_status=None, | ||
| text_creation=None, |
There was a problem hiding this comment.
Test fails as many of metadata fields are not specified. Do ask if there is question on how to fill them out
| annotations_creators="derived", | ||
| dialect=[], | ||
| sample_creation="machine-translated", | ||
| socioeconomic_status=None, |
There was a problem hiding this comment.
| socioeconomic_status=None, |
No longer used
| class AmazonCounterfactualVNClassification(AbsTaskClassification): | ||
| metadata = TaskMetadata( |
There was a problem hiding this comment.
| class AmazonCounterfactualVNClassification(AbsTaskClassification): | |
| metadata = TaskMetadata( | |
| class AmazonCounterfactualVNClassification(AbsTaskClassification): | |
| num_samples = 32 | |
| n_experiments = 10 | |
| metadata = TaskMetadata( |
There was a problem hiding this comment.
I thought abot this too, but n_experiments can't be passed like this
But 10 is default value, so it can be removed
| ) | ||
|
|
||
| @property | ||
| def metadata_dict(self) -> dict[str, str]: |
There was a problem hiding this comment.
can be deleted (see comment above)
| "revision": "b48bc27d383cfca5b6a47135a52390fa5f66b253" | ||
| }, | ||
| description=( | ||
| "A collection of Amazon customer reviews annotated for counterfactual detection pair classification." |
There was a problem hiding this comment.
Please also add a description of how it was machine translated, and that it was adapted from AmazonCounterfactualClassification.
| eval_langs=["vie-Latn"], | ||
| main_score="accuracy", | ||
| date=("2025-07-29", "2025-07-30"), | ||
| form=None, |
There was a problem hiding this comment.
| form=None, |
| license="cc-by-sa-4.0", | ||
| annotations_creators="derived", | ||
| dialect=[], | ||
| sample_creation="machine-translated", |
There was a problem hiding this comment.
I would make this "machine-translated and LM verified," given the pipeline. I would also describe the verification process in the description.
|
hi @Samoed, can you have a quick check why on the |
|
Currenly in logs there is no error in the |
|
Hi @KennethEnevoldsen , @Samoed, thanks for your constructive commend, I added new |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
There are still comments that haven't yet been resolved. Please take another look at these
I already updated the code base on the comments you gave. Please have a look |
|
Thanks! Great to have these merged |
* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items(Add default instruction) * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * rename passage to document * format --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai> Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com> Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp> Co-authored-by: zhichao-aws <zhichaog@amazon.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com> Co-authored-by: Feiyang <feiyangc@google.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Nikolay Banar <nikc20008@gmail.com> Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: fzowl <zoltan@voyageai.com> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com> Co-authored-by: roipony <roipony@gmail.com> Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Co-authored-by: admin <bo.wang@jina.ai> Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de> Co-authored-by: Victor <zbwkeepgoing@126.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items(Add default instruction) * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * fix: Add beta version of RTEB related benchmarks (#3048) * Add RTEB related benchmarks * Add RTEB related benchmarks * Correcting the task names in the RTEB benchmarks * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Adding the CURE dataset to RTEB benchmarks * Use the right language subset * Fix broken finance icon URL in RTEB benchmarks Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg Validated all icon URLs and confirmed accessibility compliance * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 1.38.47 Automatically generated by python-semantic-release * fix: run `ruff check` on all files during ci (#3086) * fix: run `ruff check` on all files during ci * format * 1.38.48 Automatically generated by python-semantic-release * Move dev to dependency groups (#3088) add dependency groups * fix: Improving validate_task_to_prompt_name logs and error messages (#3079) * Improving validate_task_to_prompt_name logs and error messages * linter fixes * Adding None prompts tests * Update test_benchmark_sentence_transformer * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: duplicate mteb multilingual variables (#3080) * fix benchmark naming * format * lint * Update tasks & benchmarks tables * model: mdbr-leaf models (#3081) * added MDBR leaf models * fixed revision for mdbr-leaf-ir * added model prompts * updated training datasets * fixed linting * lotte task reference --------- Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> * 1.38.49 Automatically generated by python-semantic-release * CI: Set upper limit for xdist version (#3098) * Commentout bibtex formatting * Remove `-n auto` * get back bibtex * try limiting versions * revert coverage * revert coverage --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Combine Plots and Tables into a Single (#3047) * feat - Combine Plots and Tables into a Single Tab #3009 * feat - Resize the plot to make it more readable * feat - Remove the (radar chart) * feat - Add a comment stating that it only shows the Top 5 models in the table. * feat - adjust layout * Update mteb/leaderboard/app.py * format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * mteb importable * format * fix model implementations * fix `validate_task_to_prompt_name` * align regression task with others * remove model overview * remove partials * format * fix tests * fix evaluators tests * add trust remote code to bsard * pre-commit run all files * add all descriptive stats * fix trust remote code test * add `RetrievalSplitData` to reranking --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai> Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com> Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp> Co-authored-by: zhichao-aws <zhichaog@amazon.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com> Co-authored-by: Feiyang <feiyangc@google.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Nikolay Banar <nikc20008@gmail.com> Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: fzowl <zoltan@voyageai.com> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com> Co-authored-by: roipony <roipony@gmail.com> Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Co-authored-by: admin <bo.wang@jina.ai> Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de> Co-authored-by: Victor <zbwkeepgoing@126.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: Ryan Mullins <ryan@ryanmullins.org> Co-authored-by: Robin Vujanic <robin-vjc@users.noreply.github.com> Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
* model: add image support for jina embeddings v4 (#2893) * feat: unify text and image embeddings for all tasks * fix: uniform batch size * fix: update error message * fix: update code task * fix: update max length * fix: apply review suggestions * model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889) * feat: add KaLM_Embedding_X_0605 in kalm_models * Update kalm_models.py for lint format * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 * kalm-emb-v2 --------- Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> * Add Classification Evaluator unit test (#2838) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: update colpali engine models (#2905) * adding vidore benchmarks * fix typo * clean vidore names + per lang eval * lint * vidore names * bibtex fix * fix revision * vidore v2 citation * update citation format and fix per-language mappings * lint: citations * typo citations * fix revisiions * lint * fix colnomic3b revision * fix colqwen2.5 revision + latest repo version * fix query agmentation tokens * colsmol revision * 1.38.35 Automatically generated by python-semantic-release * Evaluator tests (#2910) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Adding STSEvaluator and SummarizationEvaluator tests * Correcting due to the comments * Correcting due to the comments --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Classification dataset cleaning (#2900) * Classification dataset cleaning * Update pull request number * Fix metadata test * fix formatting * add script for cleaning * Update tasks & benchmarks tables * dataset: Add JapaneseSentimentClassification (#2913) Add JapaneseSentimentClassification * Update tasks & benchmarks tables * fix: change `passage` prompt to `document` (#2912) * change document to passage * fix prompt names * fix kwargs check * fix default prompt * 1.38.36 Automatically generated by python-semantic-release * model: Add OpenSearch inf-free sparse encoding models (#2903) add opensearch inf-free models Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * dataset: add BarExamQA dataset (#2916) * Add BareExamQA retrieval task * ran linter * updated details * updated details * fixed subtype name * fixed changes * ran linter again * Use `mteb.get_model` in adding_a_dataset.md (#2922) Update adding_a_dataset.md * fix: specify revision for opensearch (#2919) specify revision for opensearch * 1.38.37 Automatically generated by python-semantic-release * Update the link for gemini-embedding-001 (#2928) * fix: replace with passage (#2934) * fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940) * fix: Only import SparseEncoder once sentence-transformer version have been checked fixes #2936 * Update mteb/models/opensearch_neural_sparse_models.py Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939) The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue. * docs: Update adding_a_dataset.md (#2947) * docs: Update adding_a_dataset.md * Update docs/adding_a_dataset.md * ci: bump semantic release * 1.38.38 Automatically generated by python-semantic-release * dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935) * BSARD loader fixed * BSARDv2 metadata fixed * Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tasks & benchmarks tables * dataset: add GovReport dataset (#2953) * Added govreport task * Updated description * dataset: add BillSum datasets (#2943) * Added BillSum datasets * fixed billsumca * Updated BillSumCA description * Updated BillSumUS description * Update mteb/tasks/Retrieval/eng/BillSumCA.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/tasks/Retrieval/eng/BillSumUS.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * lint * lint --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716) * Add RuSciBench * fix bitext mining lang * Add regression task * fix init * add missing files * Improve description * Add superseded_by * fix lint * Update regression task to match with v2 * Add stratified_subsampling for regression task * Add boostrap for regression task * Rename task class, add model as evaluator argument * fix import * fix import 2 * fixes * fix * Rename regression model protocol * Update tasks & benchmarks tables * 1.38.39 Automatically generated by python-semantic-release * qzhou-embedding model_meta & implementation (#2975) * qzhou-embedding model_meta & implementation * Update qzhou_models.py * Update qzhou_models.py Processing todo items(Add default instruction) * Update qzhou_models.py correct bge datalist * Update qzhou_models.py correct 'public_training_data' * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update qzhou_models.py * Update mteb/models/qzhou_models.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/qzhou_models.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * format qzhou_models.py for ruff check --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * model: Add Voyage 3.5 model configuration (#3005) Add Voyage 3.5 model configuration - Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens - Set release date to 2025-01-21 with revision 1 - Configure for cosine similarity with instruction support - Include standard Voyage training datasets reference 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Claude <noreply@anthropic.com> * model: BAAI/bge-m3-unsupervised Model (#3007) * Add BAAI/bge-m3-unsupervised Model (BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out) * Remove the commented retromae model --------- Co-authored-by: fzowl <zoltan@voyageai.com> * lint: Correcting lint errors (#3004) * Adding Classification Evaluator test * Modifications due to the comments * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update tests/test_evaluators/test_ClassificationEvaluator.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Modifications due to the comments * Modifications due to the comments * Correcting the lint errors --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * dataset: Added 50 Vietnamese dataset from vn-mteb (#2964) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * model: Add Cohere embed-v4.0 model support (#3006) * Add Cohere embed-v4.0 model support - Add text-only embed-v4.0 model in cohere_models.py - Add multimodal embed-v4.0 model in cohere_v.py - Support configurable dimensions (256, 512, 1024, 1536) - Support 128,000 token context length - Support multimodal embedding (text, images, mixed PDFs) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Add Cohere embed-v4.0 model support Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> * Add OpenAI models with 512 dimension (#3008) * Add OpenAI/text-embedding-3-small (512 dim) Add OpenAI/text-embedding-3-large (512 dim) * Correcting due to comments --------- Co-authored-by: fzowl <zoltan@voyageai.com> * Standardise task names and fix citation formatting (#3026) fixes for name formatting * Update tasks & benchmarks tables * fix: Add missing training sets for qzhou (#3023) * Supplement missing training sets * reformat code * Reorganize the data list format * update qzhou_model meta * 1.38.40 Automatically generated by python-semantic-release * model: Add samilpwc_models meta (#3028) * model: Add samilpwc_models meta * Fix: Remove CONST * Fix: Reformat File * Update: model revision * model: Add granite-vision-embedding model (#3029) * Add files via upload * Address review comments * Address review comments * ruff format * Update mteb/models/granite_vision_embedding_models.py * lint error fix --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: incorrect revision for SNLRetrieval (#3033) The provided revisions doesn't seem to be present on: adrlau/navjordj-SNL_summarization_copy Replacing with latest revision * dataset: Add HumanEvalRetrieval task (#3022) * Add HumanEvalRetrieval dataset * Fix TaskMetadata structure and remove descriptive_stats - Use TaskMetadata class instead of dict - Remove descriptive_stats as requested in PR review - Add date field and proper import structure * Fix dataset path and use verified metadata - Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval - Use actual description from HuggingFace dataset page - Remove fabricated citation and reference - Remove revision field that was incorrect - Reference HuggingFace dataset page instead of arxiv * Add correct revision hash to HumanEval - Add revision hash: ed1f48a for reproducibility * Fix HumanEval metadata validation - Add date field for metadata completeness - Add bibtex_citation field (empty string) - Required for TaskMetadata validation to pass - Should resolve PR test failure * Address reviewer feedback - Remove trust_remote_code parameter as requested - Add revision parameter to load_dataset() calls for consistency - Use metadata revision hash in dataset loading for reproducibility * Fix field names in HumanEval dataset loading Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format. * Fix deprecated metadata_dict usage Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility. * Fix data structure for MTEB compatibility - Organize data by splits as expected by MTEB retrieval tasks - Convert scores to integers for pytrec_eval compatibility * Address PR feedback for HumanEval dataset - Add descriptive statistics using calculate_metadata_metrics() - Enhance metadata description with dataset structure details - Add complete BibTeX citation for original paper - Update to full commit hash revision - Add python-Code language tag for programming language - Explain retrieval task formulation clearly * Fix BibTeX citation formatting for HumanEvalRetrieval - Update citation to match bibtexparser formatting requirements - Fields now in alphabetical order with lowercase names - Proper trailing commas and indentation * Update tasks & benchmarks tables * 1.38.41 Automatically generated by python-semantic-release * ci: reduce parallel runs for when checking if a dataset exists (#3035) The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831) * ci: Updating rerun delays to prevent false positives errors * ci: Updating rerun delays to prevent false positives errors * model: Add GreenNode Vietnamese Embedding models (#2994) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * model: add granite-embedding-english R2 models (#3050) * fix: Updated revision for jina-embeddings-v4 (#3046) * fix: jinav4 revision Signed-off-by: admin <bo.wang@jina.ai> * change revision instead of removing it Signed-off-by: admin <bo.wang@jina.ai> --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: admin <bo.wang@jina.ai> * 1.38.42 Automatically generated by python-semantic-release * Fix 3 VN-MTEB Pair Classification tasks (#3053) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] Vietnamese Embedding models * [REMOVE] default fields metadata in Classfication tasks * [UPDATE] model to vi-vn language specific file * [FIX] lint * [FIX] model loader * [FIX] VN-MTEB 3 datasets PairClassification rename column * dataset: Add mbpp retrieval (#3037) * Add MBPP retrieval task - Code retrieval task based on 378 Python programming problems - Natural language queries matched to Python code implementations - Uses python-Code evaluation language for code-specific metrics - Includes proper citations and descriptive statistics * Add MBPPRetrieval to imports * Add descriptive statistics for MBPPRetrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * dataset: Added wikisql retrieval (#3039) * Add WikiSQL retrieval task - Code retrieval task based on WikiSQL natural language to SQL dataset - Natural language questions matched to SQL query implementations - Uses sql-Code evaluation language for SQL-specific metrics - Includes proper citations and descriptive statistics * Add WikiSQLRetrieval to imports * Add descriptive statistics for WikiSQLRetrieval * Reformatting * Reformatting * Reformatting, correcting the revision * Update tasks & benchmarks tables * ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors try to fix CI * fix MBPPRetrieval revision (#3055) Update MBPPRetrieval.py Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> * fix: Add VN-MTEB benchmark and Leaderboard (#2995) * [ADD] 50 vietnamese dataset from vn-mteb * [UPDATE] task metadata * [UPDATE] import dependencies * [UPDATE] task metadata, bibtext citation * [UPDATE-TEST] test_model_meta * [UPDATE] sample_creation to machine-translated and LM verified * [ADD] sample creation machine-translated and LM verified * [ADD] VN-MTEB benchmark and leaderboard * [FIX] wrong benchmark name * [REMOVE] default fields metadata in Classfication tasks * Update tasks & benchmarks tables * 1.38.43 Automatically generated by python-semantic-release * Add hc3finance retrieval (#3041) * Add HC3Finance retrieval task - Financial retrieval task based on HC3 Finance dataset - Financial questions matched to human and AI-generated content - Covers financial explanations, analysis, and educational content - Includes proper citations and descriptive statistics * Add HC3FinanceRetrieval to imports * Add descriptive statistics for HC3FinanceRetrieval * Reformatting * Reformatting, correcting the revision * Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Add finqa retrieval (#3042) * Add FinQA retrieval task - Financial numerical reasoning retrieval task based on FinQA dataset - Numerical financial questions matched to relevant document data - Covers earnings reports with tables and quantitative financial data - Includes proper citations and descriptive statistics * Add FinQARetrieval to imports * Add descriptive statistics for FinQARetrieval * Reformatting * Reformatting * Update mteb/tasks/Retrieval/eng/FinQARetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FinanceBenchRetrieval task (#3044) * Add FinanceBenchRetrieval * Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Update tasks & benchmarks tables * Add FreshStackRetrieval task (#3043) * Add FreshStackRetrieval * Reformatting, correcting the revision * Dataset correction * Update tasks & benchmarks tables * dataset: Add ds1000 retrieval (#3038) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Update tasks & benchmarks tables * Add ChatDoctorRetrieval (#3045) * Add ChatDoctorRetrieval * Reformatting, correcting the revision * Correct the dataset citation * Correcting due to comments * Update tasks & benchmarks tables * Correcting the (new) DS1000 dataset's revision (#3063) * Add DS1000 retrieval task - Code retrieval task based on 1,000 data science programming problems - Natural language queries matched to Python data science code - Uses python-Code evaluation language for code-specific metrics - Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries * Add DS1000Retrieval to imports * Add descriptive statistics for DS1000Retrieval * Reformatting * Reformatting * Add DS1000Retrieval task implementation * dataset: Add JinaVDR (#2942) * feat: added jinavdr benchmark * feat: added description for jinavdr * feat: fixed licenses and added bibtex * feat: made jinav4 compatible with vidore benchmark * feat: corrected query numbers * feat: removed print * feat: added max pixel argument for jina models * feat: score calculation on cpu * feat: adjust jina model for new mteb code * feat: code cleanup * feat: corrected bibtex * feat: make colpali run with jinavdr * feat: fixed comments * feat: better reference and fixed comments * feat: added date for tasks * feat: fixed missing metadata and bibtex * feat: added descriptions per dataset * Update tasks & benchmarks tables * model: Add CoDi-Embedding-V1 (#3054) * add codiemb-minicpm * replace codiemb_minicpm with codi_model * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/codi_model.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * update code * update code * reformat --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: ensure that there are always relevant docs attached to query (#3058) * fix: ensure that there are always relevant docs attached to query Here is brief test that it doesn't influence scores: ```py t1 = mteb.get_task("TwitterHjerneRetrieval") meta = mteb.get_model_meta("minishlab/potion-base-2M") eval = mteb.MTEB(tasks=[t1]) res = eval.run(model=meta.load_model()) # before fix: res[0].get_score() # np.float64(0.02837) res[0].scores before_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # with update: res[0].get_score() # np.float64(0.02837) res[0].scores with_fix = { "train": [ { "ndcg_at_1": 0.02597, "ndcg_at_3": 0.02213, "ndcg_at_5": 0.0262, "ndcg_at_10": 0.02837, "ndcg_at_20": 0.04548, "ndcg_at_100": 0.13527, "ndcg_at_1000": 0.24507, "map_at_1": 0.00866, "map_at_3": 0.01317, "map_at_5": 0.0149, "map_at_10": 0.01562, "map_at_20": 0.01898, "map_at_100": 0.02968, "map_at_1000": 0.03841, "recall_at_1": 0.00866, "recall_at_3": 0.02056, "recall_at_5": 0.02922, "recall_at_10": 0.03355, "recall_at_20": 0.08268, "recall_at_100": 0.43766, "recall_at_1000": 1.0, "precision_at_1": 0.02597, "precision_at_3": 0.02165, "precision_at_5": 0.01818, "precision_at_10": 0.01039, "precision_at_20": 0.01234, "precision_at_100": 0.01481, "precision_at_1000": 0.0034, "mrr_at_1": 0.025974, "mrr_at_3": 0.041126, "mrr_at_5": 0.04632, "mrr_at_10": 0.048485, "mrr_at_20": 0.058356, "mrr_at_100": 0.070186, "mrr_at_1000": 0.071349, "nauc_ndcg_at_1_max": 0.33969, "nauc_ndcg_at_1_std": -0.202864, "nauc_ndcg_at_1_diff1": -0.127, "nauc_ndcg_at_3_max": 0.409376, "nauc_ndcg_at_3_std": -0.039352, "nauc_ndcg_at_3_diff1": -0.022816, "nauc_ndcg_at_5_max": 0.250499, "nauc_ndcg_at_5_std": -0.115263, "nauc_ndcg_at_5_diff1": -0.057017, "nauc_ndcg_at_10_max": 0.238696, "nauc_ndcg_at_10_std": -0.138396, "nauc_ndcg_at_10_diff1": -0.045287, "nauc_ndcg_at_20_max": 0.154456, "nauc_ndcg_at_20_std": -0.070635, "nauc_ndcg_at_20_diff1": 0.074499, "nauc_ndcg_at_100_max": -0.005753, "nauc_ndcg_at_100_std": -0.074738, "nauc_ndcg_at_100_diff1": -0.005851, "nauc_ndcg_at_1000_max": 0.109439, "nauc_ndcg_at_1000_std": -0.089797, "nauc_ndcg_at_1000_diff1": -0.021634, "nauc_map_at_1_max": 0.33969, "nauc_map_at_1_std": -0.202864, "nauc_map_at_1_diff1": -0.127, "nauc_map_at_3_max": 0.385244, "nauc_map_at_3_std": -0.080638, "nauc_map_at_3_diff1": -0.060991, "nauc_map_at_5_max": 0.294871, "nauc_map_at_5_std": -0.119069, "nauc_map_at_5_diff1": -0.06234, "nauc_map_at_10_max": 0.285698, "nauc_map_at_10_std": -0.132856, "nauc_map_at_10_diff1": -0.055015, "nauc_map_at_20_max": 0.236619, "nauc_map_at_20_std": -0.100673, "nauc_map_at_20_diff1": -0.002619, "nauc_map_at_100_max": 0.15345, "nauc_map_at_100_std": -0.138888, "nauc_map_at_100_diff1": -0.02257, "nauc_map_at_1000_max": 0.171402, "nauc_map_at_1000_std": -0.134644, "nauc_map_at_1000_diff1": -0.034477, "nauc_recall_at_1_max": 0.33969, "nauc_recall_at_1_std": -0.202864, "nauc_recall_at_1_diff1": -0.127, "nauc_recall_at_3_max": 0.375072, "nauc_recall_at_3_std": -0.009643, "nauc_recall_at_3_diff1": -0.089168, "nauc_recall_at_5_max": 0.147691, "nauc_recall_at_5_std": -0.128654, "nauc_recall_at_5_diff1": -0.084259, "nauc_recall_at_10_max": 0.141055, "nauc_recall_at_10_std": -0.165932, "nauc_recall_at_10_diff1": -0.060966, "nauc_recall_at_20_max": 0.043863, "nauc_recall_at_20_std": -0.028374, "nauc_recall_at_20_diff1": 0.157575, "nauc_recall_at_100_max": -0.157183, "nauc_recall_at_100_std": -0.019437, "nauc_recall_at_100_diff1": 0.013395, # "nauc_recall_at_1000_max": nan, # "nauc_recall_at_1000_std": nan, # "nauc_recall_at_1000_diff1": nan, "nauc_precision_at_1_max": 0.33969, "nauc_precision_at_1_std": -0.202864, "nauc_precision_at_1_diff1": -0.127, "nauc_precision_at_3_max": 0.406318, "nauc_precision_at_3_std": 0.007031, "nauc_precision_at_3_diff1": -0.034709, "nauc_precision_at_5_max": 0.178131, "nauc_precision_at_5_std": -0.112493, "nauc_precision_at_5_diff1": -0.045535, "nauc_precision_at_10_max": 0.167897, "nauc_precision_at_10_std": -0.150626, "nauc_precision_at_10_diff1": -0.027811, "nauc_precision_at_20_max": 0.081428, "nauc_precision_at_20_std": -0.042304, "nauc_precision_at_20_diff1": 0.17278, "nauc_precision_at_100_max": -0.150619, "nauc_precision_at_100_std": 0.016133, "nauc_precision_at_100_diff1": -0.065571, "nauc_precision_at_1000_max": -0.017244, "nauc_precision_at_1000_std": 0.046614, "nauc_precision_at_1000_diff1": -0.028258, "nauc_mrr_at_1_max": 0.33969, "nauc_mrr_at_1_std": -0.202864, "nauc_mrr_at_1_diff1": -0.127, "nauc_mrr_at_3_max": 0.409511, "nauc_mrr_at_3_std": -0.064671, "nauc_mrr_at_3_diff1": -0.01911, "nauc_mrr_at_5_max": 0.319584, "nauc_mrr_at_5_std": -0.103546, "nauc_mrr_at_5_diff1": -0.025109, "nauc_mrr_at_10_max": 0.309614, "nauc_mrr_at_10_std": -0.117564, "nauc_mrr_at_10_diff1": -0.019691, "nauc_mrr_at_20_max": 0.262976, "nauc_mrr_at_20_std": -0.092222, "nauc_mrr_at_20_diff1": 0.024507, "nauc_mrr_at_100_max": 0.256052, "nauc_mrr_at_100_std": -0.094249, "nauc_mrr_at_100_diff1": 0.012432, "nauc_mrr_at_1000_max": 0.260112, "nauc_mrr_at_1000_std": -0.098845, "nauc_mrr_at_1000_diff1": 0.009697, "main_score": 0.02837, "hf_subset": "default", "languages": ["dan-Latn"], } ] } # check with_fix == before_fix # True * restructure * format * relax pytrec versions * fix incorrect parsing * 1.38.44 Automatically generated by python-semantic-release * Correcting the JINA models with SentenceTransformerWrapper (#3071) * ci: Add stale workflow (#3066) * add stale workflow * add permissions * add bug label to bug issue template * revert bug issue and only look at more info needed issues * more accurate name * override default * fix: open_clip package validation (#3073) * 1.38.45 Automatically generated by python-semantic-release * fix: Update revision for qzhou models (#3069) * 1.38.46 Automatically generated by python-semantic-release * Fix the reference link for CoDi-Embedding-V1 (#3075) Fix reference link * fix: Add beta version of RTEB related benchmarks (#3048) * Add RTEB related benchmarks * Add RTEB related benchmarks * Correcting the task names in the RTEB benchmarks * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Adding the CURE dataset to RTEB benchmarks * Use the right language subset * Fix broken finance icon URL in RTEB benchmarks Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg Validated all icon URLs and confirmed accessibility compliance * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY * Add the rteb_benchmarks to the BENCHMARK_REGISTRY --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * 1.38.47 Automatically generated by python-semantic-release * fix: run `ruff check` on all files during ci (#3086) * fix: run `ruff check` on all files during ci * format * 1.38.48 Automatically generated by python-semantic-release * Move dev to dependency groups (#3088) add dependency groups * fix: Improving validate_task_to_prompt_name logs and error messages (#3079) * Improving validate_task_to_prompt_name logs and error messages * linter fixes * Adding None prompts tests * Update test_benchmark_sentence_transformer * Update mteb/leaderboard/benchmark_selector.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * fix: duplicate mteb multilingual variables (#3080) * fix benchmark naming * format * lint * Update tasks & benchmarks tables * model: mdbr-leaf models (#3081) * added MDBR leaf models * fixed revision for mdbr-leaf-ir * added model prompts * updated training datasets * fixed linting * lotte task reference --------- Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> * 1.38.49 Automatically generated by python-semantic-release * CI: Set upper limit for xdist version (#3098) * Commentout bibtex formatting * Remove `-n auto` * get back bibtex * try limiting versions * revert coverage * revert coverage --------- Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * Combine Plots and Tables into a Single (#3047) * feat - Combine Plots and Tables into a Single Tab #3009 * feat - Resize the plot to make it more readable * feat - Remove the (radar chart) * feat - Add a comment stating that it only shows the Top 5 models in the table. * feat - adjust layout * Update mteb/leaderboard/app.py * format --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> * fix: Updating the default batch size calculation in the voyage models (#3091) * 1.38.50 Automatically generated by python-semantic-release * fix: Add @classmethod for @field_validators in TaskMetadata (#3100) * Align task prompt dict with `PromptType` (#3101) * align task prompt dict with `PromptType` * use value instead of enum * 1.38.51 Automatically generated by python-semantic-release * model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090) * Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 * Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth) * Format with ruff + add loader per review * Apply ruff format/fixes * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> * Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader) * Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix import * Add memory_usage_mb=808.0 and required fields to ModelMeta * Fix 210 milions of parameters --------- Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: Isaac Chung <chungisaac1217@gmail.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * fix: Allow closed datasets (#3059) * - Added an include_private parameter to the get_tasks() function that defaults to False - This ensures that by default, tests only run on public datasets - Tests can explicitly set include_private=True when needed to test private datasets - Added is_public: bool | None = None field to TaskMetadata - The field is optional and defaults to None (treated as public) - Updated the is_filled() method to exclude is_public from required fields - Added documentation * - Added an include_private parameter to the get_tasks() function that defaults to False - This ensures that by default, tests only run on public datasets - Tests can explicitly set include_private=True when needed to test private datasets - Added is_public: bool | None = None field to TaskMetadata - The field is optional and defaults to None (treated as public) - Updated the is_filled() method to exclude is_public from required fields - Added documentation * Correcting due to comments * Update mteb/abstasks/TaskMetadata.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Update mteb/overview.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Removing the not used filter_tasks_by_privacy function * Correcting due to comments * Correcting due to comments * Correcting due to comments * Removing the test case * Rename the include_private parameter to exclude_private * Rename the include_private parameter to exclude_private * Add private tasks tests * Add private tasks tests * Update tests/test_tasks/test_private_tasks.py Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * Add private tasks tests * Add private tasks tests * Add private tasks tests --------- Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> * 1.38.52 Automatically generated by python-semantic-release --------- Signed-off-by: admin <bo.wang@jina.ai> Co-authored-by: Mohammad Kalim Akram <kalim.akram@jina.ai> Co-authored-by: ItsukiFujii <42373615+ItsukiFujii@users.noreply.github.com> Co-authored-by: xinshuohu <xinshuohu@tencent.com> Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com> Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> Co-authored-by: Paul Teiletche <73120933+paultltc@users.noreply.github.com> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Alexey Vatolin <vatolinalex@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: lsz05 <shengzhe.li@sbintuitions.co.jp> Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> Co-authored-by: zhichao-aws <zhichaog@amazon.com> Co-authored-by: Abdur-Rahman Butler <79828536+abdurrahmanbutler@users.noreply.github.com> Co-authored-by: Feiyang <feiyangc@google.com> Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> Co-authored-by: semantic-release <semantic-release> Co-authored-by: Nikolay Banar <nikc20008@gmail.com> Co-authored-by: Penny Yu <51702222+PennyYu123@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com> Co-authored-by: fzowl <zoltan@voyageai.com> Co-authored-by: Bao Loc Pham <67360122+BaoLocPham@users.noreply.github.com> Co-authored-by: Kritias <50093609+ElPlaguister@users.noreply.github.com> Co-authored-by: roipony <roipony@gmail.com> Co-authored-by: Aashka Trivedi <aashka.trivedi@gmail.com> Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com> Co-authored-by: admin <bo.wang@jina.ai> Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> Co-authored-by: Maximilian Werk <maximilian.werk@gmx.de> Co-authored-by: Victor <zbwkeepgoing@126.com> Co-authored-by: Yong woo Song <ywsong.dev@kakao.com> Co-authored-by: Ryan Mullins <ryan@ryanmullins.org> Co-authored-by: Robin Vujanic <robin-vjc@users.noreply.github.com> Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com> Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com> Co-authored-by: mathlesage <134429083+mathlesage@users.noreply.github.com>
* model: add image support for jina embeddings v4 (#2893)
* feat: unify text and image embeddings for all tasks
* fix: uniform batch size
* fix: update error message
* fix: update code task
* fix: update max length
* fix: apply review suggestions
* model: add kalm_models (kalm-emb-v2) ModelMeta (new PR) (#2889)
* feat: add KaLM_Embedding_X_0605 in kalm_models
* Update kalm_models.py for lint format
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
* kalm-emb-v2
---------
Co-authored-by: xinshuohu <xinshuohu@tencent.com>
Co-authored-by: Xinshuo Hu <yanshek.woo@gmail.com>
* Add Classification Evaluator unit test (#2838)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: update colpali engine models (#2905)
* adding vidore benchmarks
* fix typo
* clean vidore names + per lang eval
* lint
* vidore names
* bibtex fix
* fix revision
* vidore v2 citation
* update citation format and fix per-language mappings
* lint: citations
* typo citations
* fix revisiions
* lint
* fix colnomic3b revision
* fix colqwen2.5 revision + latest repo version
* fix query agmentation tokens
* colsmol revision
* 1.38.35
Automatically generated by python-semantic-release
* Evaluator tests (#2910)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
* Adding STSEvaluator and SummarizationEvaluator tests
* Correcting due to the comments
* Correcting due to the comments
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Classification dataset cleaning (#2900)
* Classification dataset cleaning
* Update pull request number
* Fix metadata test
* fix formatting
* add script for cleaning
* Update tasks & benchmarks tables
* dataset: Add JapaneseSentimentClassification (#2913)
Add JapaneseSentimentClassification
* Update tasks & benchmarks tables
* fix: change `passage` prompt to `document` (#2912)
* change document to passage
* fix prompt names
* fix kwargs check
* fix default prompt
* 1.38.36
Automatically generated by python-semantic-release
* model: Add OpenSearch inf-free sparse encoding models (#2903)
add opensearch inf-free models
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* dataset: add BarExamQA dataset (#2916)
* Add BareExamQA retrieval task
* ran linter
* updated details
* updated details
* fixed subtype name
* fixed changes
* ran linter again
* Use `mteb.get_model` in adding_a_dataset.md (#2922)
Update adding_a_dataset.md
* fix: specify revision for opensearch (#2919)
specify revision for opensearch
* 1.38.37
Automatically generated by python-semantic-release
* Update the link for gemini-embedding-001 (#2928)
* fix: replace with passage (#2934)
* fix: Only import SparseEncoder once sentence-transformer version have been checked (#2940)
* fix: Only import SparseEncoder once sentence-transformer version have been checked
fixes #2936
* Update mteb/models/opensearch_neural_sparse_models.py
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* fix: Prevent incorrectly passing "selector_state" to `get_benchmark` (#2939)
The leaderboard would have (silent) errors where `get_benchmark` lead to a KeyError due to "selector_state" being passed as a default value. Setting `DEFAULT_BENCMARK_NAME` as the value solves this issue.
* docs: Update adding_a_dataset.md (#2947)
* docs: Update adding_a_dataset.md
* Update docs/adding_a_dataset.md
* ci: bump semantic release
* 1.38.38
Automatically generated by python-semantic-release
* dataset: Add BSARD v2, fixing the data loading issues of v1 (#2935)
* BSARD loader fixed
* BSARDv2 metadata fixed
* Update mteb/tasks/Retrieval/fra/BSARDRetrieval.py
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tasks & benchmarks tables
* dataset: add GovReport dataset (#2953)
* Added govreport task
* Updated description
* dataset: add BillSum datasets (#2943)
* Added BillSum datasets
* fixed billsumca
* Updated BillSumCA description
* Updated BillSumUS description
* Update mteb/tasks/Retrieval/eng/BillSumCA.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/BillSumUS.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* lint
* lint
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* fix: Add new benchmark beRuSciBench along with AbsTaskTextRegression (#2716)
* Add RuSciBench
* fix bitext mining lang
* Add regression task
* fix init
* add missing files
* Improve description
* Add superseded_by
* fix lint
* Update regression task to match with v2
* Add stratified_subsampling for regression task
* Add boostrap for regression task
* Rename task class, add model as evaluator argument
* fix import
* fix import 2
* fixes
* fix
* Rename regression model protocol
* Update tasks & benchmarks tables
* 1.38.39
Automatically generated by python-semantic-release
* qzhou-embedding model_meta & implementation (#2975)
* qzhou-embedding model_meta & implementation
* Update qzhou_models.py
* Update qzhou_models.py
Processing todo items(Add default instruction)
* Update qzhou_models.py
correct bge datalist
* Update qzhou_models.py
correct 'public_training_data'
* Update qzhou_models.py
* Update qzhou_models.py
* Update qzhou_models.py
* Update qzhou_models.py
* Update mteb/models/qzhou_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/qzhou_models.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* format qzhou_models.py for ruff check
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* model: Add Voyage 3.5 model configuration (#3005)
Add Voyage 3.5 model configuration
- Add voyage_3_5 ModelMeta with 1024 embed dimensions and 32000 max tokens
- Set release date to 2025-01-21 with revision 1
- Configure for cosine similarity with instruction support
- Include standard Voyage training datasets reference
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
* model: BAAI/bge-m3-unsupervised Model (#3007)
* Add BAAI/bge-m3-unsupervised Model
(BAAI/bge_m3_retromae is commented out - the details are proper, but it fails during loading the model for me, so i commented out)
* Remove the commented retromae model
---------
Co-authored-by: fzowl <zoltan@voyageai.com>
* lint: Correcting lint errors (#3004)
* Adding Classification Evaluator test
* Modifications due to the comments
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tests/test_evaluators/test_ClassificationEvaluator.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Modifications due to the comments
* Modifications due to the comments
* Correcting the lint errors
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* dataset: Added 50 Vietnamese dataset from vn-mteb (#2964)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [REMOVE] default fields metadata in Classfication tasks
* Update tasks & benchmarks tables
* model: Add Cohere embed-v4.0 model support (#3006)
* Add Cohere embed-v4.0 model support
- Add text-only embed-v4.0 model in cohere_models.py
- Add multimodal embed-v4.0 model in cohere_v.py
- Support configurable dimensions (256, 512, 1024, 1536)
- Support 128,000 token context length
- Support multimodal embedding (text, images, mixed PDFs)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add Cohere embed-v4.0 model support
Update cohere_v.py and cohere_models.py to include the new embed-v4.0 model with proper configuration and integration.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Add OpenAI models with 512 dimension (#3008)
* Add OpenAI/text-embedding-3-small (512 dim)
Add OpenAI/text-embedding-3-large (512 dim)
* Correcting due to comments
---------
Co-authored-by: fzowl <zoltan@voyageai.com>
* Standardise task names and fix citation formatting (#3026)
fixes for name formatting
* Update tasks & benchmarks tables
* fix: Add missing training sets for qzhou (#3023)
* Supplement missing training sets
* reformat code
* Reorganize the data list format
* update qzhou_model meta
* 1.38.40
Automatically generated by python-semantic-release
* model: Add samilpwc_models meta (#3028)
* model: Add samilpwc_models meta
* Fix: Remove CONST
* Fix: Reformat File
* Update: model revision
* model: Add granite-vision-embedding model (#3029)
* Add files via upload
* Address review comments
* Address review comments
* ruff format
* Update mteb/models/granite_vision_embedding_models.py
* lint error fix
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: incorrect revision for SNLRetrieval (#3033)
The provided revisions doesn't seem to be present on:
adrlau/navjordj-SNL_summarization_copy
Replacing with latest revision
* dataset: Add HumanEvalRetrieval task (#3022)
* Add HumanEvalRetrieval dataset
* Fix TaskMetadata structure and remove descriptive_stats
- Use TaskMetadata class instead of dict
- Remove descriptive_stats as requested in PR review
- Add date field and proper import structure
* Fix dataset path and use verified metadata
- Change path from zeroshot/humaneval-embedding-benchmark to embedding-benchmark/HumanEval
- Use actual description from HuggingFace dataset page
- Remove fabricated citation and reference
- Remove revision field that was incorrect
- Reference HuggingFace dataset page instead of arxiv
* Add correct revision hash to HumanEval
- Add revision hash: ed1f48a for reproducibility
* Fix HumanEval metadata validation
- Add date field for metadata completeness
- Add bibtex_citation field (empty string)
- Required for TaskMetadata validation to pass
- Should resolve PR test failure
* Address reviewer feedback
- Remove trust_remote_code parameter as requested
- Add revision parameter to load_dataset() calls for consistency
- Use metadata revision hash in dataset loading for reproducibility
* Fix field names in HumanEval dataset loading
Changed query_id/corpus_id to query-id/corpus-id to match actual dataset format.
* Fix deprecated metadata_dict usage
Use self.metadata.dataset instead of self.metadata_dict for v2.0 compatibility.
* Fix data structure for MTEB compatibility
- Organize data by splits as expected by MTEB retrieval tasks
- Convert scores to integers for pytrec_eval compatibility
* Address PR feedback for HumanEval dataset
- Add descriptive statistics using calculate_metadata_metrics()
- Enhance metadata description with dataset structure details
- Add complete BibTeX citation for original paper
- Update to full commit hash revision
- Add python-Code language tag for programming language
- Explain retrieval task formulation clearly
* Fix BibTeX citation formatting for HumanEvalRetrieval
- Update citation to match bibtexparser formatting requirements
- Fields now in alphabetical order with lowercase names
- Proper trailing commas and indentation
* Update tasks & benchmarks tables
* 1.38.41
Automatically generated by python-semantic-release
* ci: reduce parallel runs for when checking if a dataset exists (#3035)
The hope is that this will prevent many of the current [errors](https://github.com/embeddings-benchmark/mteb/actions/runs/17019125199/job/48245690831)
* ci: Updating rerun delays to prevent false positives errors
* ci: Updating rerun delays to prevent false positives errors
* model: Add GreenNode Vietnamese Embedding models (#2994)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] Vietnamese Embedding models
* [REMOVE] default fields metadata in Classfication tasks
* [UPDATE] model to vi-vn language specific file
* [FIX] lint
* [FIX] model loader
* model: add granite-embedding-english R2 models (#3050)
* fix: Updated revision for jina-embeddings-v4 (#3046)
* fix: jinav4 revision
Signed-off-by: admin <bo.wang@jina.ai>
* change revision instead of removing it
Signed-off-by: admin <bo.wang@jina.ai>
---------
Signed-off-by: admin <bo.wang@jina.ai>
Co-authored-by: admin <bo.wang@jina.ai>
* 1.38.42
Automatically generated by python-semantic-release
* Fix 3 VN-MTEB Pair Classification tasks (#3053)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] Vietnamese Embedding models
* [REMOVE] default fields metadata in Classfication tasks
* [UPDATE] model to vi-vn language specific file
* [FIX] lint
* [FIX] model loader
* [FIX] VN-MTEB 3 datasets PairClassification rename column
* dataset: Add mbpp retrieval (#3037)
* Add MBPP retrieval task
- Code retrieval task based on 378 Python programming problems
- Natural language queries matched to Python code implementations
- Uses python-Code evaluation language for code-specific metrics
- Includes proper citations and descriptive statistics
* Add MBPPRetrieval to imports
* Add descriptive statistics for MBPPRetrieval
* Reformatting
* Reformatting
* Update tasks & benchmarks tables
* dataset: Added wikisql retrieval (#3039)
* Add WikiSQL retrieval task
- Code retrieval task based on WikiSQL natural language to SQL dataset
- Natural language questions matched to SQL query implementations
- Uses sql-Code evaluation language for SQL-specific metrics
- Includes proper citations and descriptive statistics
* Add WikiSQLRetrieval to imports
* Add descriptive statistics for WikiSQLRetrieval
* Reformatting
* Reformatting
* Reformatting, correcting the revision
* Update tasks & benchmarks tables
* ci: Temporarily limit pytrec version to "pytrec-eval-terrier>=0.5.6, <0.5.8" to prevent errors
try to fix CI
* fix MBPPRetrieval revision (#3055)
Update MBPPRetrieval.py
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* fix: Add VN-MTEB benchmark and Leaderboard (#2995)
* [ADD] 50 vietnamese dataset from vn-mteb
* [UPDATE] task metadata
* [UPDATE] import dependencies
* [UPDATE] task metadata, bibtext citation
* [UPDATE-TEST] test_model_meta
* [UPDATE] sample_creation to machine-translated and LM verified
* [ADD] sample creation machine-translated and LM verified
* [ADD] VN-MTEB benchmark and leaderboard
* [FIX] wrong benchmark name
* [REMOVE] default fields metadata in Classfication tasks
* Update tasks & benchmarks tables
* 1.38.43
Automatically generated by python-semantic-release
* Add hc3finance retrieval (#3041)
* Add HC3Finance retrieval task
- Financial retrieval task based on HC3 Finance dataset
- Financial questions matched to human and AI-generated content
- Covers financial explanations, analysis, and educational content
- Includes proper citations and descriptive statistics
* Add HC3FinanceRetrieval to imports
* Add descriptive statistics for HC3FinanceRetrieval
* Reformatting
* Reformatting, correcting the revision
* Update mteb/tasks/Retrieval/eng/HC3FinanceRetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Add finqa retrieval (#3042)
* Add FinQA retrieval task
- Financial numerical reasoning retrieval task based on FinQA dataset
- Numerical financial questions matched to relevant document data
- Covers earnings reports with tables and quantitative financial data
- Includes proper citations and descriptive statistics
* Add FinQARetrieval to imports
* Add descriptive statistics for FinQARetrieval
* Reformatting
* Reformatting
* Update mteb/tasks/Retrieval/eng/FinQARetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* Add FinanceBenchRetrieval task (#3044)
* Add FinanceBenchRetrieval
* Update mteb/tasks/Retrieval/eng/FinanceBenchRetrieval.py
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Update tasks & benchmarks tables
* Add FreshStackRetrieval task (#3043)
* Add FreshStackRetrieval
* Reformatting, correcting the revision
* Dataset correction
* Update tasks & benchmarks tables
* dataset: Add ds1000 retrieval (#3038)
* Add DS1000 retrieval task
- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries
* Add DS1000Retrieval to imports
* Add descriptive statistics for DS1000Retrieval
* Reformatting
* Reformatting
* Update tasks & benchmarks tables
* Add ChatDoctorRetrieval (#3045)
* Add ChatDoctorRetrieval
* Reformatting, correcting the revision
* Correct the dataset citation
* Correcting due to comments
* Update tasks & benchmarks tables
* Correcting the (new) DS1000 dataset's revision (#3063)
* Add DS1000 retrieval task
- Code retrieval task based on 1,000 data science programming problems
- Natural language queries matched to Python data science code
- Uses python-Code evaluation language for code-specific metrics
- Covers pandas, numpy, matplotlib, scikit-learn, and scipy libraries
* Add DS1000Retrieval to imports
* Add descriptive statistics for DS1000Retrieval
* Reformatting
* Reformatting
* Add DS1000Retrieval task implementation
* dataset: Add JinaVDR (#2942)
* feat: added jinavdr benchmark
* feat: added description for jinavdr
* feat: fixed licenses and added bibtex
* feat: made jinav4 compatible with vidore benchmark
* feat: corrected query numbers
* feat: removed print
* feat: added max pixel argument for jina models
* feat: score calculation on cpu
* feat: adjust jina model for new mteb code
* feat: code cleanup
* feat: corrected bibtex
* feat: make colpali run with jinavdr
* feat: fixed comments
* feat: better reference and fixed comments
* feat: added date for tasks
* feat: fixed missing metadata and bibtex
* feat: added descriptions per dataset
* Update tasks & benchmarks tables
* model: Add CoDi-Embedding-V1 (#3054)
* add codiemb-minicpm
* replace codiemb_minicpm with codi_model
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/codi_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* update code
* update code
* reformat
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: ensure that there are always relevant docs attached to query (#3058)
* fix: ensure that there are always relevant docs attached to query
Here is brief test that it doesn't influence scores:
```py
t1 = mteb.get_task("TwitterHjerneRetrieval")
meta = mteb.get_model_meta("minishlab/potion-base-2M")
eval = mteb.MTEB(tasks=[t1])
res = eval.run(model=meta.load_model())
# before fix:
res[0].get_score() # np.float64(0.02837)
res[0].scores
before_fix = {
"train": [
{
"ndcg_at_1": 0.02597,
"ndcg_at_3": 0.02213,
"ndcg_at_5": 0.0262,
"ndcg_at_10": 0.02837,
"ndcg_at_20": 0.04548,
"ndcg_at_100": 0.13527,
"ndcg_at_1000": 0.24507,
"map_at_1": 0.00866,
"map_at_3": 0.01317,
"map_at_5": 0.0149,
"map_at_10": 0.01562,
"map_at_20": 0.01898,
"map_at_100": 0.02968,
"map_at_1000": 0.03841,
"recall_at_1": 0.00866,
"recall_at_3": 0.02056,
"recall_at_5": 0.02922,
"recall_at_10": 0.03355,
"recall_at_20": 0.08268,
"recall_at_100": 0.43766,
"recall_at_1000": 1.0,
"precision_at_1": 0.02597,
"precision_at_3": 0.02165,
"precision_at_5": 0.01818,
"precision_at_10": 0.01039,
"precision_at_20": 0.01234,
"precision_at_100": 0.01481,
"precision_at_1000": 0.0034,
"mrr_at_1": 0.025974,
"mrr_at_3": 0.041126,
"mrr_at_5": 0.04632,
"mrr_at_10": 0.048485,
"mrr_at_20": 0.058356,
"mrr_at_100": 0.070186,
"mrr_at_1000": 0.071349,
"nauc_ndcg_at_1_max": 0.33969,
"nauc_ndcg_at_1_std": -0.202864,
"nauc_ndcg_at_1_diff1": -0.127,
"nauc_ndcg_at_3_max": 0.409376,
"nauc_ndcg_at_3_std": -0.039352,
"nauc_ndcg_at_3_diff1": -0.022816,
"nauc_ndcg_at_5_max": 0.250499,
"nauc_ndcg_at_5_std": -0.115263,
"nauc_ndcg_at_5_diff1": -0.057017,
"nauc_ndcg_at_10_max": 0.238696,
"nauc_ndcg_at_10_std": -0.138396,
"nauc_ndcg_at_10_diff1": -0.045287,
"nauc_ndcg_at_20_max": 0.154456,
"nauc_ndcg_at_20_std": -0.070635,
"nauc_ndcg_at_20_diff1": 0.074499,
"nauc_ndcg_at_100_max": -0.005753,
"nauc_ndcg_at_100_std": -0.074738,
"nauc_ndcg_at_100_diff1": -0.005851,
"nauc_ndcg_at_1000_max": 0.109439,
"nauc_ndcg_at_1000_std": -0.089797,
"nauc_ndcg_at_1000_diff1": -0.021634,
"nauc_map_at_1_max": 0.33969,
"nauc_map_at_1_std": -0.202864,
"nauc_map_at_1_diff1": -0.127,
"nauc_map_at_3_max": 0.385244,
"nauc_map_at_3_std": -0.080638,
"nauc_map_at_3_diff1": -0.060991,
"nauc_map_at_5_max": 0.294871,
"nauc_map_at_5_std": -0.119069,
"nauc_map_at_5_diff1": -0.06234,
"nauc_map_at_10_max": 0.285698,
"nauc_map_at_10_std": -0.132856,
"nauc_map_at_10_diff1": -0.055015,
"nauc_map_at_20_max": 0.236619,
"nauc_map_at_20_std": -0.100673,
"nauc_map_at_20_diff1": -0.002619,
"nauc_map_at_100_max": 0.15345,
"nauc_map_at_100_std": -0.138888,
"nauc_map_at_100_diff1": -0.02257,
"nauc_map_at_1000_max": 0.171402,
"nauc_map_at_1000_std": -0.134644,
"nauc_map_at_1000_diff1": -0.034477,
"nauc_recall_at_1_max": 0.33969,
"nauc_recall_at_1_std": -0.202864,
"nauc_recall_at_1_diff1": -0.127,
"nauc_recall_at_3_max": 0.375072,
"nauc_recall_at_3_std": -0.009643,
"nauc_recall_at_3_diff1": -0.089168,
"nauc_recall_at_5_max": 0.147691,
"nauc_recall_at_5_std": -0.128654,
"nauc_recall_at_5_diff1": -0.084259,
"nauc_recall_at_10_max": 0.141055,
"nauc_recall_at_10_std": -0.165932,
"nauc_recall_at_10_diff1": -0.060966,
"nauc_recall_at_20_max": 0.043863,
"nauc_recall_at_20_std": -0.028374,
"nauc_recall_at_20_diff1": 0.157575,
"nauc_recall_at_100_max": -0.157183,
"nauc_recall_at_100_std": -0.019437,
"nauc_recall_at_100_diff1": 0.013395,
# "nauc_recall_at_1000_max": nan,
# "nauc_recall_at_1000_std": nan,
# "nauc_recall_at_1000_diff1": nan,
"nauc_precision_at_1_max": 0.33969,
"nauc_precision_at_1_std": -0.202864,
"nauc_precision_at_1_diff1": -0.127,
"nauc_precision_at_3_max": 0.406318,
"nauc_precision_at_3_std": 0.007031,
"nauc_precision_at_3_diff1": -0.034709,
"nauc_precision_at_5_max": 0.178131,
"nauc_precision_at_5_std": -0.112493,
"nauc_precision_at_5_diff1": -0.045535,
"nauc_precision_at_10_max": 0.167897,
"nauc_precision_at_10_std": -0.150626,
"nauc_precision_at_10_diff1": -0.027811,
"nauc_precision_at_20_max": 0.081428,
"nauc_precision_at_20_std": -0.042304,
"nauc_precision_at_20_diff1": 0.17278,
"nauc_precision_at_100_max": -0.150619,
"nauc_precision_at_100_std": 0.016133,
"nauc_precision_at_100_diff1": -0.065571,
"nauc_precision_at_1000_max": -0.017244,
"nauc_precision_at_1000_std": 0.046614,
"nauc_precision_at_1000_diff1": -0.028258,
"nauc_mrr_at_1_max": 0.33969,
"nauc_mrr_at_1_std": -0.202864,
"nauc_mrr_at_1_diff1": -0.127,
"nauc_mrr_at_3_max": 0.409511,
"nauc_mrr_at_3_std": -0.064671,
"nauc_mrr_at_3_diff1": -0.01911,
"nauc_mrr_at_5_max": 0.319584,
"nauc_mrr_at_5_std": -0.103546,
"nauc_mrr_at_5_diff1": -0.025109,
"nauc_mrr_at_10_max": 0.309614,
"nauc_mrr_at_10_std": -0.117564,
"nauc_mrr_at_10_diff1": -0.019691,
"nauc_mrr_at_20_max": 0.262976,
"nauc_mrr_at_20_std": -0.092222,
"nauc_mrr_at_20_diff1": 0.024507,
"nauc_mrr_at_100_max": 0.256052,
"nauc_mrr_at_100_std": -0.094249,
"nauc_mrr_at_100_diff1": 0.012432,
"nauc_mrr_at_1000_max": 0.260112,
"nauc_mrr_at_1000_std": -0.098845,
"nauc_mrr_at_1000_diff1": 0.009697,
"main_score": 0.02837,
"hf_subset": "default",
"languages": ["dan-Latn"],
}
]
}
# with update:
res[0].get_score() # np.float64(0.02837)
res[0].scores
with_fix = {
"train": [
{
"ndcg_at_1": 0.02597,
"ndcg_at_3": 0.02213,
"ndcg_at_5": 0.0262,
"ndcg_at_10": 0.02837,
"ndcg_at_20": 0.04548,
"ndcg_at_100": 0.13527,
"ndcg_at_1000": 0.24507,
"map_at_1": 0.00866,
"map_at_3": 0.01317,
"map_at_5": 0.0149,
"map_at_10": 0.01562,
"map_at_20": 0.01898,
"map_at_100": 0.02968,
"map_at_1000": 0.03841,
"recall_at_1": 0.00866,
"recall_at_3": 0.02056,
"recall_at_5": 0.02922,
"recall_at_10": 0.03355,
"recall_at_20": 0.08268,
"recall_at_100": 0.43766,
"recall_at_1000": 1.0,
"precision_at_1": 0.02597,
"precision_at_3": 0.02165,
"precision_at_5": 0.01818,
"precision_at_10": 0.01039,
"precision_at_20": 0.01234,
"precision_at_100": 0.01481,
"precision_at_1000": 0.0034,
"mrr_at_1": 0.025974,
"mrr_at_3": 0.041126,
"mrr_at_5": 0.04632,
"mrr_at_10": 0.048485,
"mrr_at_20": 0.058356,
"mrr_at_100": 0.070186,
"mrr_at_1000": 0.071349,
"nauc_ndcg_at_1_max": 0.33969,
"nauc_ndcg_at_1_std": -0.202864,
"nauc_ndcg_at_1_diff1": -0.127,
"nauc_ndcg_at_3_max": 0.409376,
"nauc_ndcg_at_3_std": -0.039352,
"nauc_ndcg_at_3_diff1": -0.022816,
"nauc_ndcg_at_5_max": 0.250499,
"nauc_ndcg_at_5_std": -0.115263,
"nauc_ndcg_at_5_diff1": -0.057017,
"nauc_ndcg_at_10_max": 0.238696,
"nauc_ndcg_at_10_std": -0.138396,
"nauc_ndcg_at_10_diff1": -0.045287,
"nauc_ndcg_at_20_max": 0.154456,
"nauc_ndcg_at_20_std": -0.070635,
"nauc_ndcg_at_20_diff1": 0.074499,
"nauc_ndcg_at_100_max": -0.005753,
"nauc_ndcg_at_100_std": -0.074738,
"nauc_ndcg_at_100_diff1": -0.005851,
"nauc_ndcg_at_1000_max": 0.109439,
"nauc_ndcg_at_1000_std": -0.089797,
"nauc_ndcg_at_1000_diff1": -0.021634,
"nauc_map_at_1_max": 0.33969,
"nauc_map_at_1_std": -0.202864,
"nauc_map_at_1_diff1": -0.127,
"nauc_map_at_3_max": 0.385244,
"nauc_map_at_3_std": -0.080638,
"nauc_map_at_3_diff1": -0.060991,
"nauc_map_at_5_max": 0.294871,
"nauc_map_at_5_std": -0.119069,
"nauc_map_at_5_diff1": -0.06234,
"nauc_map_at_10_max": 0.285698,
"nauc_map_at_10_std": -0.132856,
"nauc_map_at_10_diff1": -0.055015,
"nauc_map_at_20_max": 0.236619,
"nauc_map_at_20_std": -0.100673,
"nauc_map_at_20_diff1": -0.002619,
"nauc_map_at_100_max": 0.15345,
"nauc_map_at_100_std": -0.138888,
"nauc_map_at_100_diff1": -0.02257,
"nauc_map_at_1000_max": 0.171402,
"nauc_map_at_1000_std": -0.134644,
"nauc_map_at_1000_diff1": -0.034477,
"nauc_recall_at_1_max": 0.33969,
"nauc_recall_at_1_std": -0.202864,
"nauc_recall_at_1_diff1": -0.127,
"nauc_recall_at_3_max": 0.375072,
"nauc_recall_at_3_std": -0.009643,
"nauc_recall_at_3_diff1": -0.089168,
"nauc_recall_at_5_max": 0.147691,
"nauc_recall_at_5_std": -0.128654,
"nauc_recall_at_5_diff1": -0.084259,
"nauc_recall_at_10_max": 0.141055,
"nauc_recall_at_10_std": -0.165932,
"nauc_recall_at_10_diff1": -0.060966,
"nauc_recall_at_20_max": 0.043863,
"nauc_recall_at_20_std": -0.028374,
"nauc_recall_at_20_diff1": 0.157575,
"nauc_recall_at_100_max": -0.157183,
"nauc_recall_at_100_std": -0.019437,
"nauc_recall_at_100_diff1": 0.013395,
# "nauc_recall_at_1000_max": nan,
# "nauc_recall_at_1000_std": nan,
# "nauc_recall_at_1000_diff1": nan,
"nauc_precision_at_1_max": 0.33969,
"nauc_precision_at_1_std": -0.202864,
"nauc_precision_at_1_diff1": -0.127,
"nauc_precision_at_3_max": 0.406318,
"nauc_precision_at_3_std": 0.007031,
"nauc_precision_at_3_diff1": -0.034709,
"nauc_precision_at_5_max": 0.178131,
"nauc_precision_at_5_std": -0.112493,
"nauc_precision_at_5_diff1": -0.045535,
"nauc_precision_at_10_max": 0.167897,
"nauc_precision_at_10_std": -0.150626,
"nauc_precision_at_10_diff1": -0.027811,
"nauc_precision_at_20_max": 0.081428,
"nauc_precision_at_20_std": -0.042304,
"nauc_precision_at_20_diff1": 0.17278,
"nauc_precision_at_100_max": -0.150619,
"nauc_precision_at_100_std": 0.016133,
"nauc_precision_at_100_diff1": -0.065571,
"nauc_precision_at_1000_max": -0.017244,
"nauc_precision_at_1000_std": 0.046614,
"nauc_precision_at_1000_diff1": -0.028258,
"nauc_mrr_at_1_max": 0.33969,
"nauc_mrr_at_1_std": -0.202864,
"nauc_mrr_at_1_diff1": -0.127,
"nauc_mrr_at_3_max": 0.409511,
"nauc_mrr_at_3_std": -0.064671,
"nauc_mrr_at_3_diff1": -0.01911,
"nauc_mrr_at_5_max": 0.319584,
"nauc_mrr_at_5_std": -0.103546,
"nauc_mrr_at_5_diff1": -0.025109,
"nauc_mrr_at_10_max": 0.309614,
"nauc_mrr_at_10_std": -0.117564,
"nauc_mrr_at_10_diff1": -0.019691,
"nauc_mrr_at_20_max": 0.262976,
"nauc_mrr_at_20_std": -0.092222,
"nauc_mrr_at_20_diff1": 0.024507,
"nauc_mrr_at_100_max": 0.256052,
"nauc_mrr_at_100_std": -0.094249,
"nauc_mrr_at_100_diff1": 0.012432,
"nauc_mrr_at_1000_max": 0.260112,
"nauc_mrr_at_1000_std": -0.098845,
"nauc_mrr_at_1000_diff1": 0.009697,
"main_score": 0.02837,
"hf_subset": "default",
"languages": ["dan-Latn"],
}
]
}
# check
with_fix == before_fix # True
* restructure
* format
* relax pytrec versions
* fix incorrect parsing
* 1.38.44
Automatically generated by python-semantic-release
* Correcting the JINA models with SentenceTransformerWrapper (#3071)
* ci: Add stale workflow (#3066)
* add stale workflow
* add permissions
* add bug label to bug issue template
* revert bug issue and only look at more info needed issues
* more accurate name
* override default
* fix: open_clip package validation (#3073)
* 1.38.45
Automatically generated by python-semantic-release
* fix: Update revision for qzhou models (#3069)
* 1.38.46
Automatically generated by python-semantic-release
* Fix the reference link for CoDi-Embedding-V1 (#3075)
Fix reference link
* fix: Add beta version of RTEB related benchmarks (#3048)
* Add RTEB related benchmarks
* Add RTEB related benchmarks
* Correcting the task names in the RTEB benchmarks
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Adding the CURE dataset to RTEB benchmarks
* Use the right language subset
* Fix broken finance icon URL in RTEB benchmarks
Replace broken libre-finance-dollar.svg with working libre-gui-price-tag.svg
Validated all icon URLs and confirmed accessibility compliance
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
* Add the rteb_benchmarks to the BENCHMARK_REGISTRY
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* 1.38.47
Automatically generated by python-semantic-release
* fix: run `ruff check` on all files during ci (#3086)
* fix: run `ruff check` on all files during ci
* format
* 1.38.48
Automatically generated by python-semantic-release
* Move dev to dependency groups (#3088)
add dependency groups
* fix: Improving validate_task_to_prompt_name logs and error messages (#3079)
* Improving validate_task_to_prompt_name logs and error messages
* linter fixes
* Adding None prompts tests
* Update test_benchmark_sentence_transformer
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: duplicate mteb multilingual variables (#3080)
* fix benchmark naming
* format
* lint
* Update tasks & benchmarks tables
* model: mdbr-leaf models (#3081)
* added MDBR leaf models
* fixed revision for mdbr-leaf-ir
* added model prompts
* updated training datasets
* fixed linting
* lotte task reference
---------
Co-authored-by: Robin Vujanic <robin.vujanic@mongodb.com>
* 1.38.49
Automatically generated by python-semantic-release
* CI: Set upper limit for xdist version (#3098)
* Commentout bibtex formatting
* Remove `-n auto`
* get back bibtex
* try limiting versions
* revert coverage
* revert coverage
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* Combine Plots and Tables into a Single (#3047)
* feat - Combine Plots and Tables into a Single Tab #3009
* feat - Resize the plot to make it more readable
* feat - Remove the (radar chart)
* feat - Add a comment stating that it only shows the Top 5 models in the table.
* feat - adjust layout
* Update mteb/leaderboard/app.py
* format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
* fix: Updating the default batch size calculation in the voyage models (#3091)
* 1.38.50
Automatically generated by python-semantic-release
* fix: Add @classmethod for @field_validators in TaskMetadata (#3100)
* Align task prompt dict with `PromptType` (#3101)
* align task prompt dict with `PromptType`
* use value instead of enum
* 1.38.51
Automatically generated by python-semantic-release
* model: Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1 (#3090)
* Add ModelMeta for OrdalieTech/Solon-embeddings-mini-beta-1.1
* Add training_datasets (common_corpus, fineweb, wiki_fr, private LLM-synth)
* Format with ruff + add loader per review
* Apply ruff format/fixes
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Register OrdalieTech/Solon-embeddings-mini-beta-1.1 in overview (ModelMeta + loader)
* Update mteb/models/ordalietech_solon_embeddings_mini_beta_1_1.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix import
* Add memory_usage_mb=808.0 and required fields to ModelMeta
* Fix 210 milions of parameters
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: Allow closed datasets (#3059)
* - Added an include_private parameter to the get_tasks() function that defaults to False
- This ensures that by default, tests only run on public datasets
- Tests can explicitly set include_private=True when needed to test private datasets
- Added is_public: bool | None = None field to TaskMetadata
- The field is optional and defaults to None (treated as public)
- Updated the is_filled() method to exclude is_public from required fields
- Added documentation
* - Added an include_private parameter to the get_tasks() function that defaults to False
- This ensures that by default, tests only run on public datasets
- Tests can explicitly set include_private=True when needed to test private datasets
- Added is_public: bool | None = None field to TaskMetadata
- The field is optional and defaults to None (treated as public)
- Updated the is_filled() method to exclude is_public from required fields
- Added documentation
* Correcting due to comments
* Update mteb/abstasks/TaskMetadata.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/overview.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Removing the not used filter_tasks_by_privacy function
* Correcting due to comments
* Correcting due to comments
* Correcting due to comments
* Removing the test case
* Rename the include_private parameter to exclude_private
* Rename the include_private parameter to exclude_private
* Add private tasks tests
* Add private tasks tests
* Update tests/test_tasks/test_private_tasks.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Add private tasks tests
* Add private tasks tests
* Add private tasks tests
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.52
Automatically generated by python-semantic-release
* Ci: test out GH models with welcoming new comers (#3112)
test out GH models with welcoming new comers
* ci: Dataset check on new PR (#3103)
* add dataset check on new PR
* add extract datasets
* run as module
* update startswith
* update workflow name
* add GitPython
* export var
* same shell session
* address review comments
* add to docs to say what this script does
* add docs
* model: add Youtu-Embedding-V1 (#3115)
* add youtu models
* add a blank line
* fix the optional dependencies and lint the code
* remove unused dependencies and reformat
* revise prompt_type
---------
Co-authored-by: springxchen <springxchen@tencent.com>
* fix: add voyage quantization models (#3092)
* Adding quantization support
* Update mteb/models/voyage_models.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/model_meta.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Simplifying the quantization/output_dtype
* Update mteb/model_meta.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* 1.38.53
Automatically generated by python-semantic-release
* model: EmbeddingGemma 300M (#3129)
* model: EmbeddingGemma 300M
* Add license and revision
* fix: Add dedicated display for RTEB benchmark results (#3089)
* feat - remove special filtering, keep zero-shot, keep borda rank
* feat - remove get_rteb_benchmark.py
* feat - delete get_rteb_benchmark.py;RTEB_BENCHMARK_ENTRIES changes
* feat -format
* Update mteb/load_results/benchmark_results.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* 1.38.54
Automatically generated by python-semantic-release
* dataset: Add Dapfam patent retrieval tasks (#2946)
* chore: add 'Patent retrieval' subtype to TaskMetadata
* feat(retrieval): add DAPFAM patent retrieval tasks (+18 variants)
* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)
* Dapfam patent retrieval PR #2946 : refactor DAPFAM tasks (explicit classes, license, metadata, custom definition explanation ...)
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Changes :
- Added possibility to opt in or out of quantization through the "quantize" argument.
- Added possibility to compute raw dot product without normalization. (to reproduce the paper method the "similarity" argument should be "cosine").
- Removed unecessary function and overhauled the tasks descriptions to be more clear.
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Changes made :
- Overhauled task descriptions as well as naming to conform with the naming scheme of mteb retrieval tasks.
- Similarity is now computed using the similarity function of the passed model.
- Changed optional quantization method to conform with sentence transformers similarity function.
to reproduce the paper metrics, one can use the following snippet :
```python
from mteb import mteb
from sentence_transformers import SentenceTransformer
model_name = "Snowflake/snowflake-arctic-embed-m-v2.0"
model = SentenceTransformer(model_name,
model_kwargs={
"torch_dtype": "float16",
},
trust_remote_code=True,
).cuda().eval()
tasks = mteb.get_tasks(tasks=[
"DAPFAMInTitlAbsToTitlAbsClmRetrieval",
"DAPFAMAllTitlAbsToTitlAbsClmRetrieval",
"DAPFAMOutTitlAbsToTitlAbsClmRetrieval",
add the other 3 remaining tasks ...
])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
output_folder=f"mteb_res/{model_name}",
quantize=True, # if set to false or not set, the obtained ndcg@10 and map@10 will be ~0.001 higher
encode_kwargs= {"batch_size" : 32}
)
```
* changed default value of quantization to false
* added the import to all DAPFAM tasks; tested that the works; verified compliance with the checklist
* Update mteb/tasks/Retrieval/eng/DAPFAMPatentRetrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* added revision numbers to all dataset loading operations as well as the metadata itself
* intermediate changes, refresh local branch
* intermediate changes, refresh local branch again
* scale back to standard evaluation with empty set exclusion, various cosmetic/formatting changes
* minor cosmetic/formatting changes
* fixed main metric to be ndcg_at_100 as in the paper
* removed old code artifacts from previous versions
* read appropriate loading arguments from task metadata, remove unecessary class attribute
* reformat bibtex ( remark on the assertion since it tries to match literal string instead of bibtex formatting, format inconsistent with arXiv default), fixed metadata, parameters read from task metadata, all tests passed
* refactor data loading to read from metadata class attributes
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* Align max tokens (#3172)
* Correct the VoyageAI model's batch creation/batch size calculation (#3185)
Correct the batch creation
* dataset: Adding JapaneseCode1Retrieval as the first non-public dataset (#3168)
* Adding JapaneseCode1Retrieval as the first non-public dataset
* Transformed dataset
* Adding as private dataset to tests
* Correct the private task test
* Use the sample dataset as a reference
* Use the sample dataset as a reference
* fix ds loading
* allow on forks
* upd aciton
* remove paths
* try to trigger ci
* add ref
* add permissions
* remove paths
* add paths back
* get back to pull request
* rollback action
* Trying to resolve the token/secret problem
* Trying to resolve the token/secret problem
* Update dataset_loading_pr.yml
* Update dataset_loading_pr.yml
* Try the latest datasets package (worked for me)
* Try the latest datasets package (worked for me)
* Try the latest datasets package (worked for me)
* (last?) try
* (last?) try
* (last?) try
* Reverting the changes
* Exclude the private datasets from tests
* Apply suggestions from code review
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Solomatin Roman <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* fix: add version check for `embeddinggemma-300m` (#3189)
add version check
* dataset: Added a set of closed datasets (#3186)
* Add 12 more closed datasets
Extend the RTEB benchmarks
* trust_remote_code
* trust_remote_code
* Enabling JapaneseCode1Retrieval in the RTEB benchmarks
* Add closed datasets as private tasks
* Correct due to the comment
* Update tasks & benchmarks tables
* fix: Edit ack & sponsors (#3187)
* dataset: Update FaMTEB to Version 2 (#3157)
* Update benchmark to version 2
* make others in benchmark selector one line code
* small changes
* update a few tasks metadata
* update faintent license with correct form
* remove redundant trust remote codes
* fix hardnegatives revision
* make lint
* fix errors
* apply suggestions
* fix citation problem
* add PR link to benchmark desc
* remove duplicate dataset names in mcinext_models
* update prompts
---------
Co-authored-by: mehran <mehan.sarmadi16@gmail.com>
* Update tasks & benchmarks tables
* 1.38.55
Automatically generated by python-semantic-release
* fix: Add conflicting dependencies to toml (#3191)
fix conflict dependencies
* 1.38.56
Automatically generated by python-semantic-release
* fix: Correct metadata for ArguAna dataset (#3202)
* Update tasks & benchmarks tables
* 1.38.57
Automatically generated by python-semantic-release
* model: Add BMRetriever (#3195)
* model: Add BMRetriever
* Update mteb/models/bmretriever_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/bmretriever_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* fix: remove trust_remote_code option
* feat: implement BMREtrieverWrapper based on InstructSentenceTransformerWrapper
* refactor: update training datasets for bmretriever
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Revert "Ci: test out GH models with welcoming new comers" (#3206)
Revert "Ci: test out GH models with welcoming new comers (#3112)"
This reverts commit 73a35e0bb02e61108d50385f4c43fd7d1b16e984.
* model: Add Codefuse models (#3205)
* add codefuse models
* add codefuse models
* Update codefuse_models.py
* lint codefuse.py
* fix(models): ensure prompt_type is passed to format_instruction (#3216)
* 1.38.58
Automatically generated by python-semantic-release
* Adding Cohere's output_dimension and embedding_type parameter (#3204)
* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8
* Correcting due to comments
* dataset: add swedish cpc patent classifications to mteb (#3072)
* feat: add swedish cpc patent classifications to mteb
* fix: formatting and init imports
* fix: update mteb task according to feedback
* fix: perform citation and code formatting
* fix: add train and test split for both datasets
* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior
* chore: fix colpali_models similarity handle device
* Update tasks & benchmarks tables
* 1.38.59
Automatically generated by python-semantic-release
* fix: prevent EOS token truncation (#3218)
* fix(models): prevent EOS token truncation for BMRetriever
* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`
* fix(models): correct eos token handling in `BMRetrieverWrapper`
* 1.38.60
Automatically generated by python-semantic-release
* Update giga embeddings (#3210)
* update giga embeddings
* update giga embeddings
* 3b-september-2025
* fixed
* lint
* Update mteb/models/ru_sentence_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* change revision due to flash-attn dependency
* change apply_instruction_to_passages
---------
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
* fix: Refactor split create_tables into static Benchmark methods (#3126)
* feat - Split create_tables into static Benchmark methods
* feat - format
* Update mteb/leaderboard/table.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - remove search query;take benchmark result as input;addressing the circular import,
* feat - format
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - use to_dataframe;clean table.py;move creat_table
* feat - fix circular import
* feat - clean-up
* feat - format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.61
Automatically generated by python-semantic-release
* Extending the RTEB benchmark (#3223)
Adding another voyageai model
* Update tasks & benchmarks tables
* model: New qzmodel (#3211)
* Update qzhou_models.py
* Update qzhou_models.py
* reformat script code
* Update configuration
* According to our new decision, the model name has been changed to "QZhou-Embedding-Zh".
* Fix variable naming issues.
* model: Update Youtu embedding model (#3227)
* add youtu models
* add a blank line
* fix the optional dependencies and lint the code
* remove unused dependencies and reformat
* revise prompt_type
* update youtu_models
---------
Co-authored-by: springxchen <springxchen@tencent.com>
* dataset: Add Software Issue Localization Datasets (#3178)
* add software issue localization datasets
* add software issue localization datasets
* update and add multilingual datasets
* fix citation format issues
* Update mteb/tasks/Reranking/eng/SWEbenchVerifiedReranking.py
* fix linting issues
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update tasks & benchmarks tables
* feat: Officially include RTEB in the leaderboard (#3222)
* feat - adjust Rteb's Benchmark
* feat - add blank
* fix menu names
* Update mteb/leaderboard/benchmark_selector.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* moving around tasks
* fix: Update RTEB summary columns (#3226)
* fix(models): ensure prompt_type is passed to format_instruction (#3216)
* 1.38.58
Automatically generated by python-semantic-release
* Adding Cohere's output_dimension and embedding_type parameter (#3204)
* Adding Cohere's output_dimension and embedding_type parameter
Cohere's embed-v4 binary and int8
* Correcting due to comments
* dataset: add swedish cpc patent classifications to mteb (#3072)
* feat: add swedish cpc patent classifications to mteb
* fix: formatting and init imports
* fix: update mteb task according to feedback
* fix: perform citation and code formatting
* fix: add train and test split for both datasets
* fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
* fix: delete kwargs for similarity score in ColPaliEngineWrapper for method behavior
* chore: fix colpali_models similarity handle device
* Update tasks & benchmarks tables
* 1.38.59
Automatically generated by python-semantic-release
* fix: prevent EOS token truncation (#3218)
* fix(models): prevent EOS token truncation for BMRetriever
* refactor(models): refactor tokenizer setup in `InstructSentenceTransformerWrapper`
* fix(models): correct eos token handling in `BMRetrieverWrapper`
* 1.38.60
Automatically generated by python-semantic-release
* Update giga embeddings (#3210)
* update giga embeddings
* update giga embeddings
* 3b-september-2025
* fixed
* lint
* Update mteb/models/ru_sentence_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* change revision due to flash-attn dependency
* change apply_instruction_to_passages
---------
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
* fix: Refactor split create_tables into static Benchmark methods (#3126)
* feat - Split create_tables into static Benchmark methods
* feat - format
* Update mteb/leaderboard/table.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - remove search query;take benchmark result as input;addressing the circular import,
* feat - format
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update mteb/benchmarks/benchmark.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* feat - use to_dataframe;clean table.py;move creat_table
* feat - fix circular import
* feat - clean-up
* feat - format
---------
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* 1.38.61
Automatically generated by python-semantic-release
* Extending the RTEB benchmark (#3223)
Adding another voyageai model
* Update tasks & benchmarks tables
* feat - filter_by_privacy
* feat - add new fields for rteb part
* feat - getattr
* feat - adjust privacy filter logic
* feat - enhance summary table column renaming and add 'is_public' field mapping
* fix: remove unused 'is_public' attribute from TaskResult
---------
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: semantic-release <semantic-release>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
* removed show_rteb args
* avoid defining function where we can just use the metadata
* minor fixes
* minor fixes
* fix: Correct logic for filtering public tasks in ModelResult class (#3230)
Co-authored-by: ethan <smiletoye@gmail.com>
---------
Co-authored-by: q275343119 <275343119@qq.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: 笑尿伊人 <44760272+q275343119@users.noreply.github.com>
Co-authored-by: Yongbin Choi <whybe.choi@gmail.com>
Co-authored-by: fzoll <5575946+fzoll@users.noreply.github.com>
Co-authored-by: Atheer <atheer2104@protonmail.com>
Co-authored-by: Yong woo Song <ywsong.dev@kakao.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Egor <31567312+ekolodin@users.noreply.github.com>
Co-authored-by: Kolodin Egor <eikolodin@sberbank.ru>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
Co-authored-by: Неизвестный Пользователь722497 <dolegosmirnov@sberbank.ru>
Co-authored-by: smile <smile@pinai.io>
Co-authored-by: ethan <smiletoye@gmail.com>
* Update tasks & benchmarks tables
* 1.39.0
Automatically generated by python-semantic-release
* fix: Add submission references for RTEB (#3233)
* fix: Add rteb submission references and improve descriptions.
* Added evaluation request
* added field for tasks
* 1.39.1
Automatically generated by python-semantic-release
* dataset: add human tasks and benchmark (#3214)
* Human Subsets Tasks
* Fixed Multilingual Classification Subset
* linting
* fix citations format
* make lint
* fix tests
* remove human folder
* fix relative imports
* add adapted_from for all human subsets
* fix pydantic errors
* add benchmark object
* make benchmark discoverable
* bibtex test
* Apply suggestion
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
* rename & reupload
* upd tests
* upd tests again
* add model
* add benchmark to leaderboard
* change branch of leaderboard
* remove branch of load data
* fix model meta path
* make mteb importable
* update repo
* Update mteb/benchmarks/benchmarks/benchmarks.py
* Update mteb/leaderboard/benchmark_selector.py
* Update mteb/load_results/load_results.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
---------
Co-authored-by: Adnan El Assadi <aassadi22@ku.edu.tr>
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: AdnanElAssadi56 <115242814+AdnanElAssadi56@users.noreply.github.com>
* Update tasks & benchmarks tables
* Remove 'HUME(v1)' from leaderboard benchmark (#3236)
* Remove 'HUME(v1)' from leaderboard benchmark
* lint
* docs: Update adding benchmark documentation (#3229)
* update adding_a_benchmark.md documentation
* fix numbers
* fix: Further specified macro-language code for Norwegian (#3228)
* fix: Further specified macro-language code for Norwegian
"nor" is a macro-language code that covers bokmål and nynorsk (both norwegian), but this means that these datasets will be missed if using "nob" or "nno". Specifying it like this should allow this.
* furhter specified macro language "nor"
* Update tasks & benchmarks tables
* 1.39.2
Automatically generated by python-semantic-release
* fix max tokens (#3243)
* fix python39 transformers compatibility (#3254)
* fix python39 transformers
* fix
* Aggregate by subset for HUMEv1 (#3255)
aggregate by subset for HUMEv1
* Update tasks & benchmarks tables
* Fix AbsTaskTextRegression task (#3257)
Fix AbsTaskTextRegression
* Added Japanese to Retrieval (#3252)
* feat - add Japanese
* feat - use mteb.get_benchmark
* fix - 3.9 test error
* Revert "fix - 3.9 test error"
This reverts commit 6bfee53cff48304cc22d8248aa275dcc9e385475.
* fix - 3.9 test error
* Update tasks & benchmarks tables
* fix bm25 on small datasets (#3261)
* fix: Move zero-shot percentage calculation to the end of summary (#3231)
* Refactor: Move zero-shot percentage calculation to the end of summary table creation which only apply to RTEB table.
* Update RTEB benchmark name from "RTEB(beta)" to "RTEB" for consistency in display.
* feat - RTEB(beta)
* feat - remove Zero-shot
---------
Co-authored-by: ethan <smiletoye@gmail.com>
* model: Add ReasonIR (#3221)
* model: Add ReasonIR
* Update mteb/models/reasonir_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* Update mteb/models/reasonir_model.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* update n_parameters of ReasonIR
Co-authored-by: Niklas <n.muennighoff@gmail.com>
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Niklas <n.muennighoff@gmail.com>
* fix: Only pin model name and rank (#3263)
Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column
* 1.39.3
Automatically generated by python-semantic-release
* fix: resolve flash-attention dependency issue (#3265)
* fix: Only pin model name and rank
Currently we pin 3 columns, this makes it hard or impossible to view on phones. The 3rd column is also no longer garuanteed as RTEB leaderboard does not use the zero-shot column
* fix: resolve flash-attention dependency issue
This has been tested and works.
fixed Resolve flash-attention dependency issues
Fixes #3240
* 1.39.4
Automatically generated by python-semantic-release
* fix: Add retry and token counting in Cohere models (#3253)
* Retry and token counting in Cohere models
* Retry and token counting in Cohere models
* Retry and token counting in Cohere models
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com>
* 1.39.5
Automatically generated by python-semantic-release
* Align MIEB leaderboards with paper (#3272)
* sort by mean task type and use pure rank for MIEB LBs
* lint
* rename task type column for readability
* fix: add prompt for MIRACLRetrievalHardNegatives (#3266)
* add prompt for MIRACLRetrievalHardNegatives
* add `MIRACLRetrievalHardNegatives.v2`
* Update mteb/tasks/Retrieval/multilingual/MIRACLRetrieval.py
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* move common metadata to dict
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
* Update tasks & benchmarks tables
* Add Regression task mock (#3271)
* 1.39.6
Automatically generated by python-semantic-release
* fix: Change language for task SlovakMovieReviewSentimentClassification (#3296)
* Update tasks & benchmarks tables
* 1.39.7
Automatically generated by python-semantic-release
* Add english code retriever model (#3302)
* Add en code retriever model
* fix model_name
* Update mteb/models/en_code_retriever.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* correct lint
---------
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* docs: fix typos in `docs/adding_a_benchmark.md` (#3344)
* BREAKING: v2.0.0 (#1433)
* [v2] Merge…
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command.
All of the dataset is taken from our work from this paper, this is the preprint citation of our dataset.
I'll update the models results after this PR to create a new benchmark for
VN-MTEB